Efficient Document Indexing Using Pivot Tree

نویسندگان

  • Gaurav Singh
  • Benjamin Piwowarski
چکیده

We present a novel method for efficiently searching top-k neighbors for documents represented in high dimensional space of terms based on the cosine similarity. Mostly, documents are stored as bagof-words tf-idf representation. One of the most used ways of computing similarity between a pair of documents is cosine similarity between the vector representations, but cosine similarity is not a metric distance measure as it doesn’t follow triangle inequality, therefore most metric searching methods can not be applied directly. We propose an efficient method for indexing documents using a pivot tree that leads to efficient retrieval. We also study the relation between precision and efficiency for the proposed method and compare it with a state of the art in the area of document searching based on inner product.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pivot Selection Methods Based on Covariance and Correlation for Metric-space Indexing

Metric-space indexing is a general method for similarity queries of complex data. The quality of the index tree is a critical factor of the query performance. Bulkloading a metricspace indexing tree can be represented by two recursive steps, pivot selection and data partition, while pivot selection dominants the quality of the index tree. Two heuristics, based on covariance and correlation, for...

متن کامل

Optimal Pivot Selection Method Based on the Partition and the Pruning Effect for Metric Space Indexes

This paper proposes a new method to reduce the cost of nearest neighbor searches in metric spaces. Many similarity search indexes recursively divide a region into subregions by using pivots, and construct a tree-structured index. Most of recently developed indexes focus on pruning objects and do not pay much attention to the tree balancing. As a result, indexes having imbalanced tree-structure ...

متن کامل

Document Transformation System from Papers to XML Data Based on Pivot XML Document Method

This paper proposes a new method for document transformation using OCR to generate various XML documents from printed documents. The proposed method adopts a hierarchical transformation strategy based on a pivot XML document. Firstly, document elements such as title, authors, abstract, headings, paragraphs, lists, captions, tables and figures are extracted from document images. Secondly, the hi...

متن کامل

Random Indexing K-tree

Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering. RI K-tree scales well with large inputs due to its low complexity. It also exhibits features that are useful for managing a changing collection. Furthermore, it solves previous issues with sparse document vectors when using Ktree. The algorithms and data struc...

متن کامل

D-Tree: A Multi-Dimensional Indexing Structure for Constructing Document Warehouses

Document warehouses, unlike traditional document management systems, contain extensive semantic information about documents, cross-document feature relations, and document grouping or clustering, thus providing an accurate and efficient access to business intelligence information. Since documents are multi-dimensional in nature, we claim that traditional indexing methods are not really suitable...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1605.06693  شماره 

صفحات  -

تاریخ انتشار 2016